Strings패키지 공부

출처 : http://www.di.fc.ul.pt/~jpn/r/


Standard functions

  • paste : pastes vectors together
  • substr : extract/replace substrings in a character vector
  • substring : expand cyclically several results
  • strsplit : split the elements into substrings according to the matches(uses regular expressions)


In [3]:
# date()를 통해 현지 시간 출력
paste("Today is ", date())


'Today is Mon Sep 04 16:22:23 2017'

In [5]:
# paste0 : 문자 사이에 공백 없이 출력
xs <- 1:7
paste0("A", xs)


  1. 'A1'
  2. 'A2'
  3. 'A3'
  4. 'A4'
  5. 'A5'
  6. 'A6'
  7. 'A7'

In [6]:
paste("A", xs, sep = ",")


  1. 'A,1'
  2. 'A,2'
  3. 'A,3'
  4. 'A,4'
  5. 'A,5'
  6. 'A,6'
  7. 'A,7'

In [13]:
letters[1:10] # 소문자 알파벳 출력
paste(letters[1:10], xs, sep = "|")


  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  5. 'e'
  6. 'f'
  7. 'g'
  8. 'h'
  9. 'i'
  10. 'j'
  1. 'a|1'
  2. 'b|2'
  3. 'c|3'
  4. 'd|4'
  5. 'e|5'
  6. 'f|6'
  7. 'g|7'
  8. 'h|1'
  9. 'i|2'
  10. 'j|3'

In [16]:
paste(letters[1:10], xs, sep = "|", collapse = ",")


'a|1,b|2,c|3,d|4,e|5,f|6,g|7,h|1,i|2,j|3'

In [34]:
cs <- "o mapa nao e o territorio"
paste0(cs,", tem ", nchar(cs), "characteres")


'o mapa nao e o territorio, tem 25characteres'

In [35]:
# start, end로 문자 뽑아내기
substr(cs, 3, 6)


'mapa'

In [36]:
# 해당 위치 문자 변환
substr(cs, 3, 6) <- "MAPA"
cs


'o MAPA nao e o territorio'

In [42]:
# substr은 결과가 1개 뿐이지만
# substring은 여러개 결과 출력
substring(cs, 2, 4:6)


  1. ' MA'
  2. ' MAP'
  3. ' MAPA'

In [48]:
# strsplit(데이터, 패턴)
# o와 a를 기준으로 문자열을 분리
cs <- "o mapa nao e o territorio"
strsplit(cs,"[oa]")


    1. ''
    2. ' m'
    3. 'p'
    4. ' n'
    5. ''
    6. ' e '
    7. ' territ'
    8. 'ri'

In [45]:
cs <- paste(letters[1:10],1:7,sep="|",collapse=",")
cs


'a|1,b|2,c|3,d|4,e|5,f|6,g|7,h|1,i|2,j|3'

In [47]:
# ,와 |을 기준으로 문자열을 분리
cs1 <- strsplit(cs,"[,|]")[[1]]
cs1


  1. 'a'
  2. '1'
  3. 'b'
  4. '2'
  5. 'c'
  6. '3'
  7. 'd'
  8. '4'
  9. 'e'
  10. '5'
  11. 'f'
  12. '6'
  13. 'g'
  14. '7'
  15. 'h'
  16. '1'
  17. 'i'
  18. '2'
  19. 'j'
  20. '3'

In [49]:
cs1 <- paste0(cs1,collapse="")
cs1


'a1b2c3d4e5f6g7h1i2j3'

Regular Expressions(정규표현식)



In [52]:
# 숫자를 기준으로 문자열을 분리
strsplit(cs1,"[1-9]")


    1. 'a'
    2. 'b'
    3. 'c'
    4. 'd'
    5. 'e'
    6. 'f'
    7. 'g'
    8. 'h'
    9. 'i'
    10. 'j'

In [56]:
# . : 어떠한 문자든 출력 a.b.c.d의 길이 만큼
strsplit("a.b.c.d", ".") 
# .d : .d로 끝나는 문자 출력
strsplit("a.b.c.d", ".d")


    1. ''
    2. ''
    3. ''
    4. ''
    5. ''
    6. ''
    7. ''
  1. 'a.b.c'

In [67]:
# .을 이용해서 문자를 분리하려면 \\(escape)사용
strsplit("a.b.c", "\\.")


    1. 'a'
    2. 'b'
    3. 'c'

In [59]:
cs <- c("aaa","abb","ccc","dda","eaa")

In [66]:
# sub : 패턴이 처음 매칭되는 문자 변경
sub("a", "X", cs)


  1. 'Xaa'
  2. 'Xbb'
  3. 'ccc'
  4. 'ddX'
  5. 'eXa'

In [65]:
# gsub : 패턴이 매칭되는 모든 문자 변경
gsub("a", "X", cs)


  1. 'XXX'
  2. 'Xbb'
  3. 'ccc'
  4. 'ddX'
  5. 'eXX'

In [69]:
text.test <- "Evidence for a model (or belief) must be considered against alternative models. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there's an alternative model ('just a lucky guess') that also explains it and it's much more likely to be the right model (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It's crucial to consider the alternatives when we want to put our beliefs to the test."
text.test


'Evidence for a model (or belief) must be considered against alternative models. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there\'s an alternative model (\'just a lucky guess\') that also explains it and it\'s much more likely to be the right model (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It\'s crucial to consider the alternatives when we want to put our beliefs to the test.'

In [70]:
# belief 또는 model 이라는 문자를 XXX로 변경
gsub("belief|model","XXX",text.test)


'Evidence for a XXX (or XXX) must be considered against alternative XXXs. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there\'s an alternative XXX (\'just a lucky guess\') that also explains it and it\'s much more likely to be the right XXX (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It\'s crucial to consider the alternatives when we want to put our XXXs to the test.'

In [72]:
# t ~~ ?t : t로 둘러쌓여있는 문자중에서
# [a-z]* : 0번 이상 나온 모든 문자들 XXX로 변경
gsub("t([a-z]*)?t","XXX",text.test)


'Evidence for a model (or belief) must be considered against alXXXive models. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you XXX the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there\'s an alXXXive model (\'just a lucky guess\') XXX also explains it and it\'s much more likely to be the right model (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of sXXXical inference. It\'s crucial to consider the alXXXives when we want to put our beliefs to the XXX.'
  • [a-z]+ : [a-z]가 적어도 한번 이상
  • [a-z]* : [a-z]가 0번 이상
  • [a-z]? : [a-z]가 0 or 1번

In [76]:
# 반복 문자 지우기
# [a-z]가 2번 반복된 경우 
# ex) ee, ss
gsub("([a-z])\\1","YY",text.test)


'Evidence for a model (or belief) must be considered against alternative models. Let me describe a neutral (and very simple) example: AYYume I say I have Extra Sensorial Perception (ESP) and teYY you that the next dice throw wiYY be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there\'s an alternative model (\'just a lucky gueYY\') that also explains it and it\'s much more likely to be the right model (because ESP nYYds much more aYYumptions, many of those in conflict with aYYepted facts and theories). This is a subject of statistical inference. It\'s crucial to consider the alternatives when we want to put our beliefs to the test.'

In [77]:
# model이라는 문자의 좌우에 * 추가
gsub("(model)","*\\1*",text.test)


'Evidence for a *model* (or belief) must be considered against alternative *model*s. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is evidence for my claim of ESP. However there\'s an alternative *model* (\'just a lucky guess\') that also explains it and it\'s much more likely to be the right *model* (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It\'s crucial to consider the alternatives when we want to put our beliefs to the test.'

In [80]:
# vid가 들어가는 부분을 idv로 변경
gsub("(v)(i)(d)","\\2\\3\\1", text.test)


'Eidvence for a model (or belief) must be considered against alternative models. Let me describe a neutral (and very simple) example: Assume I say I have Extra Sensorial Perception (ESP) and tell you that the next dice throw will be 1. You throw the dice and I was right. That is eidvence for my claim of ESP. However there\'s an alternative model (\'just a lucky guess\') that also explains it and it\'s much more likely to be the right model (because ESP needs much more assumptions, many of those in conflict with accepted facts and theories). This is a subject of statistical inference. It\'s crucial to consider the alternatives when we want to put our beliefs to the test.'

In [82]:
# ([^a-zA-Z]) : 어떤 알파벳으로 시작하던지간에
# ([aA][a-z]+) : 중간에 a 또는 A가 들어가고 소문자로 끝나면
# \\1% : 첫 번째에는 %를 넣고
# \\2* : 두 번째에는 *를 넣어라
gsub("([^a-zA-Z])([aA][a-z]+)","\\1%\\2*",text.test)


'Evidence for a model (or belief) must be considered %against* %alternative* models. Let me describe a neutral (%and* very simple) example: %Assume* I say I have Extra Sensorial Perception (ESP) %and* tell you that the next dice throw will be 1. You throw the dice %and* I was right. That is evidence for my claim of ESP. However there\'s %an* %alternative* model (\'just a lucky guess\') that %also* explains it %and* it\'s much more likely to be the right model (because ESP needs much more %assumptions*, many of those in conflict with %accepted* facts %and* theories). This is a subject of statistical inference. It\'s crucial to consider the %alternatives* when we want to put our beliefs to the test.'

In [92]:
gsub("([^a-zA-Z])([a-z]){1,3}([^a-zA-Z])","\\1ZZZ\\3",
     text.test)


'Evidence ZZZ a model (ZZZ belief) must ZZZ considered against alternative models. Let ZZZ describe ZZZ neutral (ZZZ very simple) example: Assume I ZZZ I have Extra Sensorial Perception (ESP) ZZZ tell ZZZ that ZZZ next dice throw will ZZZ 1. You throw ZZZ dice ZZZ I ZZZ right. That ZZZ evidence ZZZ my claim ZZZ ESP. However there\'ZZZ an alternative model (\'just ZZZ lucky guess\') that also explains ZZZ and ZZZ\'s much more likely ZZZ be ZZZ right model (because ESP needs much more assumptions, many ZZZ those ZZZ conflict with accepted facts ZZZ theories). This ZZZ a subject ZZZ statistical inference. It\'ZZZ crucial ZZZ consider ZZZ alternatives when ZZZ want ZZZ put ZZZ beliefs ZZZ the test.'

In [94]:
# , . : 공백 () '로 문자열 분리
separators <- "[,.: ()']"
tokens <- strsplit(text.test, separators)[[1]]  
# "" 제거
tokens <- tokens[tokens != ""]                  
tokens


  1. 'Evidence'
  2. 'for'
  3. 'a'
  4. 'model'
  5. 'or'
  6. 'belief'
  7. 'must'
  8. 'be'
  9. 'considered'
  10. 'against'
  11. 'alternative'
  12. 'models'
  13. 'Let'
  14. 'me'
  15. 'describe'
  16. 'a'
  17. 'neutral'
  18. 'and'
  19. 'very'
  20. 'simple'
  21. 'example'
  22. 'Assume'
  23. 'I'
  24. 'say'
  25. 'I'
  26. 'have'
  27. 'Extra'
  28. 'Sensorial'
  29. 'Perception'
  30. 'ESP'
  31. 'and'
  32. 'tell'
  33. 'you'
  34. 'that'
  35. 'the'
  36. 'next'
  37. 'dice'
  38. 'throw'
  39. 'will'
  40. 'be'
  41. '1'
  42. 'You'
  43. 'throw'
  44. 'the'
  45. 'dice'
  46. 'and'
  47. 'I'
  48. 'was'
  49. 'right'
  50. 'That'
  51. 'is'
  52. 'evidence'
  53. 'for'
  54. 'my'
  55. 'claim'
  56. 'of'
  57. 'ESP'
  58. 'However'
  59. 'there'
  60. 's'
  61. 'an'
  62. 'alternative'
  63. 'model'
  64. 'just'
  65. 'a'
  66. 'lucky'
  67. 'guess'
  68. 'that'
  69. 'also'
  70. 'explains'
  71. 'it'
  72. 'and'
  73. 'it'
  74. 's'
  75. 'much'
  76. 'more'
  77. 'likely'
  78. 'to'
  79. 'be'
  80. 'the'
  81. 'right'
  82. 'model'
  83. 'because'
  84. 'ESP'
  85. 'needs'
  86. 'much'
  87. 'more'
  88. 'assumptions'
  89. 'many'
  90. 'of'
  91. 'those'
  92. 'in'
  93. 'conflict'
  94. 'with'
  95. 'accepted'
  96. 'facts'
  97. 'and'
  98. 'theories'
  99. 'This'
  100. 'is'
  101. 'a'
  102. 'subject'
  103. 'of'
  104. 'statistical'
  105. 'inference'
  106. 'It'
  107. 's'
  108. 'crucial'
  109. 'to'
  110. 'consider'
  111. 'the'
  112. 'alternatives'
  113. 'when'
  114. 'we'
  115. 'want'
  116. 'to'
  117. 'put'
  118. 'our'
  119. 'beliefs'
  120. 'to'
  121. 'the'
  122. 'test'

In [95]:
# dice의 위치가 어디일까요?
grep("dice", tokens, fixed=TRUE)


  1. 37
  2. 45

In [102]:
string <- "abcedabcfaa"
# string 문자열을 알파벳 1개 단위로 분리하고

In [103]:
gsub("([a-z])","\\1,",string)
# 해당 결과를 ,기준으로 다시 분리 => 결과는 list


'a,b,c,e,d,a,b,c,f,a,a,'

In [110]:
# strsplit 결과가 list이기에 [[1]] (대괄호 2개)
strsplit(gsub("([a-z])","\\1,",string),",")


    1. 'a'
    2. 'b'
    3. 'c'
    4. 'e'
    5. 'd'
    6. 'a'
    7. 'b'
    8. 'c'
    9. 'f'
    10. 'a'
    11. 'a'

In [111]:
cs <- strsplit(gsub("([a-z])","\\1,",string),",")[[1]]
cs


  1. 'a'
  2. 'b'
  3. 'c'
  4. 'e'
  5. 'd'
  6. 'a'
  7. 'b'
  8. 'c'
  9. 'f'
  10. 'a'
  11. 'a'

Regexpr



In [113]:
cs <- c("aaa", "axx", "xaa", "axx", "xxx", "xxx")

In [115]:
# regexpr : 처음 매칭되는 요소의 위치를 출력 (-1은 not)
regexpr("a", cs)


  1. 1
  2. 1
  3. 2
  4. 1
  5. -1
  6. -1

In [123]:
# a* : a가 0번 이상 => 전부다 1이겠징
test <- regexpr("a*", cs)
test
# attr을 사용하면 매칭되는 문자의 길이를 알 수 있습니다.
attr(test, "match.length")


  1. 1
  2. 1
  3. 1
  4. 1
  5. 1
  6. 1
  1. 3
  2. 1
  3. 0
  4. 1
  5. 0
  6. 0

In [124]:
cs <- c("123ab67","ab321","10000","0","abc")

In [127]:
# 1: [a-z]* : 소문자 알파벳으로 0번 이상 시작하며
# 2: [0-9]+ : 숫자가 한 번이라도 나오는 cs의 index위치
# [a-z]*([0-9]+) : 알파벳이 0번이상이니 숫자로 시작하는 문자 찾기
# 각 표현식에 대한 결과
regexec("[a-z]*([0-9]+)",cs)


    1. 1
    2. 1
    1. 1
    2. 3
    1. 1
    2. 1
    1. 1
    2. 1
  1. -1

R에서는 list형태로 출력되기에 다른 결과

  • [[1]]
  • [1] 1 1
  • attr(,"match.length")
  • [1] 3 3
  • attr(,"useBytes")
  • [1] TRUE

In [131]:
set.seed(101)
# 임시 데이터 20개 생성
pop.data <- paste("the population is", 
                  floor(runif(20,1e3,5e4)),"birds")
head(pop.data)


  1. 'the population is 19237 birds'
  2. 'the population is 3147 birds'
  3. 'the population is 35774 birds'
  4. 'the population is 33226 birds'
  5. 'the population is 13242 birds'
  6. 'the population is 15702 birds'

In [138]:
# 1은 숫자들 중에 처음  
# 19는 해당 문자열에서 index가 19 => ([0-9]*) 괄호가 필요
reg.info <- regexec("the population is ([0-9]*) birds",
                    pop.data)
reg.info[1:3]


    1. 1
    2. 19
    1. 1
    2. 19
    1. 1
    2. 19

In [140]:
# [[1]][1] : ([0-9]*) => 숫자가 포함되는 문자들
# [[1]][2] : 일치하는 패턴만 나타내주는 듯
reg.data <- regmatches(pop.data, reg.info)
reg.data[1:3]


    1. 'the population is 19237 birds'
    2. '19237'
    1. 'the population is 3147 birds'
    2. '3147'
    1. 'the population is 35774 birds'
    2. '35774'

In [142]:
# x[2] 즉 [[1]][2]의 결과를 vector형태로 출력(sapply)
bird.population <- sapply(reg.data, function(x)x[2])
bird.population


  1. '19237'
  2. '3147'
  3. '35774'
  4. '33226'
  5. '13242'
  6. '15702'
  7. '29658'
  8. '17339'
  9. '31478'
  10. '27745'
  11. '44109'
  12. '35636'
  13. '36866'
  14. '46650'
  15. '23300'
  16. '29925'
  17. '41201'
  18. '11981'
  19. '21171'
  20. '2891'

In [147]:
set.seed(1303)
# 난수 time series 생성
steps <- sample(-2:2, size=200, 
                prob=c(.1,.2,.2,.4,.1) ,replace=TRUE)
steps
# 누적합
ts <- cumsum(steps)
plot(ts, type="l")


  1. 1
  2. 0
  3. -2
  4. 0
  5. -2
  6. 2
  7. -1
  8. 1
  9. 0
  10. -1
  11. -1
  12. 0
  13. 0
  14. 0
  15. -1
  16. 1
  17. 1
  18. -2
  19. 1
  20. 1
  21. 0
  22. 1
  23. 1
  24. 1
  25. -1
  26. -2
  27. 0
  28. 0
  29. 0
  30. 2
  31. 1
  32. 0
  33. 1
  34. 1
  35. 2
  36. 1
  37. 1
  38. 1
  39. -1
  40. 1
  41. 1
  42. 0
  43. -1
  44. -1
  45. -1
  46. -2
  47. -1
  48. -2
  49. -1
  50. -2
  51. 1
  52. 0
  53. 1
  54. 1
  55. 1
  56. -1
  57. 0
  58. 1
  59. 1
  60. -1
  61. -2
  62. 1
  63. 0
  64. -1
  65. 2
  66. 1
  67. -1
  68. 2
  69. 0
  70. 0
  71. -2
  72. 0
  73. 1
  74. 1
  75. 1
  76. 0
  77. 1
  78. -1
  79. -2
  80. 2
  81. -1
  82. 1
  83. -1
  84. 1
  85. 1
  86. 1
  87. 0
  88. -1
  89. 2
  90. 1
  91. -2
  92. -2
  93. 0
  94. -2
  95. 0
  96. 2
  97. 0
  98. 1
  99. -2
  100. -1
  101. 2
  102. 0
  103. -1
  104. -2
  105. 0
  106. 1
  107. 1
  108. -1
  109. 0
  110. 1
  111. 1
  112. -1
  113. 1
  114. 0
  115. -1
  116. 2
  117. 0
  118. 1
  119. 1
  120. 1
  121. -1
  122. -2
  123. -1
  124. 1
  125. 1
  126. -2
  127. 2
  128. 1
  129. 1
  130. -1
  131. -2
  132. 1
  133. 1
  134. 1
  135. 2
  136. 0
  137. -1
  138. 1
  139. -1
  140. -2
  141. 1
  142. 2
  143. 1
  144. 0
  145. 0
  146. 1
  147. -1
  148. -2
  149. -2
  150. -1
  151. 1
  152. 1
  153. -1
  154. 2
  155. -2
  156. -1
  157. -2
  158. -1
  159. -1
  160. -2
  161. 1
  162. -2
  163. 1
  164. -1
  165. 2
  166. 0
  167. 0
  168. 2
  169. 0
  170. 1
  171. 1
  172. -1
  173. 1
  174. 0
  175. 2
  176. 0
  177. 1
  178. -2
  179. 2
  180. 2
  181. 0
  182. 0
  183. -2
  184. 1
  185. 0
  186. 1
  187. 0
  188. 1
  189. 0
  190. 1
  191. -1
  192. 1
  193. -1
  194. -1
  195. -2
  196. -2
  197. -2
  198. -1
  199. 0
  200. 1

In [148]:
# 뭐 증가 감소를 나타내는걸 0,1로 표현한듯
difs <- sign(diff(ts)>=0)  # 0 if decreased, 1 otherwise

bits <- paste0(difs,collapse="")  # colapse into a string of bits
bits


'1010101100111011011111100111111111111011100000000111110111001101101110111111001010111101100101111001100111011101101111100011011100111110100111111000011010000001010111111101111101111011111110100000011'

In [149]:
# 00으로 시작되는 부분 index
matches <- gregexpr("00+", bits, perl = T)[[1]]
matches


  1. 9
  2. 24
  3. 42
  4. 59
  5. 77
  6. 90
  7. 98
  8. 102
  9. 120
  10. 129
  11. 138
  12. 146
  13. 154
  14. 192

In [151]:
# 00으로 시작되는 부분 index에 해당하는 length
attributes(matches)$match.length


  1. 2
  2. 2
  3. 8
  4. 2
  5. 2
  6. 2
  7. 2
  8. 2
  9. 3
  10. 2
  11. 2
  12. 4
  13. 6
  14. 6

여러개의 0으로 이어지는 부분 출력해서 0의 개수가 특정 수 이상인 부분 찾을 때 좋을듯


In [154]:
plot(ts, type="n")
min.y <- rep(min(ts),length(matches))
max.y <- rep(max(ts),length(matches))
rect(matches, min.y, matches+attributes(matches)$match.length, max.y, col="lightgrey", border=FALSE)
points(ts, type="l")



In [8]:
library(stringr)

str1 <- c("o mapa")
str2 <- c("nao e o territorio")
str3 <- str_c(str1,str2, sep=" ") # paste와 동일 결과
str3
str4 <- paste(str1, str2)
str4


'o mapa nao e o territorio'
'o mapa nao e o territorio'

In [13]:
# R 내장 letters (소문자 알파벳)
str_c(letters, collapse = ", ") # 소문자
str_c(LETTERS, collapse = ", ") # 대문자


'a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z'
'A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z'

In [14]:
str_length(str3) # 공백포함


25

In [20]:
str_dup("ab",5)  # 하나의 vector로 반환
rep("ab", each = 5) # rep는 반복


'ababababab'
  1. 'ab'
  2. 'ab'
  3. 'ab'
  4. 'ab'
  5. 'ab'

In [21]:
str_dup(c("ab","c"),3)


  1. 'ababab'
  2. 'ccc'

In [22]:
str_dup("ab",1:3)


  1. 'ab'
  2. 'abab'
  3. 'ababab'

In [23]:
str3
str_count(str3, "r")  # r의 개수


3

In [24]:
str_detect(str3, "r")


TRUE

In [27]:
# [it][eo]+ => ie, io, te, to 4개의 케이스중 
#              가장 먼저 매칭되는 문자를 찾으세요
str3
str_extract(str3, "[it][eo]+") 
str4 <- "ie to istel"
str_extract(str4, "[it][eo]+")


'o mapa nao e o territorio'
'te'
'ie'

In [28]:
str_extract_all(str3, "[it][eo]+")


    1. 'te'
    2. 'to'
    3. 'io'

In [29]:
str_locate(str3, "[it][eo]+")


startend
1617

In [32]:
str_locate_all(str3, "[it][eo]+")
test <- str_locate_all(str3, "[it][eo]+")
str(test) # 결과는 list형태


  1. startend
    1617
    2122
    2425
List of 1
 $ : int [1:3, 1:2] 16 21 24 17 22 25
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "start" "end"

In [33]:
str_replace(str3,"r","R")      # replace first match
str_replace_all(str3,"r","R")  # replace all matches


'o mapa nao e o teRritorio'
'o mapa nao e o teRRitoRio'

In [34]:
str_split(str3,"e") # e를 기준으로 문자를 쪼개라


    1. 'o mapa nao '
    2. ' o t'
    3. 'rritorio'

In [35]:
str_split(str3,"e",n=2) 
# argument n(결과로 반환될 개수) 
# 해당 n이 충족되면 종료


    1. 'o mapa nao '
    2. ' o territorio'

In [36]:
str_sub(str3, 1, 3)


'o m'

In [37]:
str_sub(str3,
        seq(1,24,2), # 1,3,5,7... 23
        seq(2,25,2)) # 2,4,6,8... 24
# (1,2) / (3,4) ... (23,24) 순서쌍으로 문자가 출력됨


  1. 'o '
  2. 'ma'
  3. 'pa'
  4. ' n'
  5. 'ao'
  6. ' e'
  7. ' o'
  8. ' t'
  9. 'er'
  10. 'ri'
  11. 'to'
  12. 'ri'

In [39]:
str4 <- "BBCDEF"
# 특정 문자 위치를 찾아서 변환가능
str_sub(str4, 1, 1) <- "A"
str4


'ABCDEF'

In [47]:
strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569",
  "387 287 6718", "apple", "233.398.9187  ", "482 952 3315",
  "239 923 8115", "842 566 4692", "Work: 579-499-7527", "$1000",
  "Home: 543.355.3679")
phone <- "([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"
str_extract(strings, phone)

# 정규표현식 ()단위로 하나의 결과가 출력되는 듯
# [2-9][5-9]{2} : (2~9)(0~9)의 숫자가 적어도 2번 반복
# [- .] : - / 공백 / . 이 들어가고 
# [0-9]{3}  : 3개의 문자가 0~9사이 숫자
# [0-9]{4}  : 4개의 문자가 0~9사이 숫자


  1. '219 733 8965'
  2. '329-293-8753'
  3. NA
  4. '595 794 7569'
  5. '387 287 6718'
  6. NA
  7. '233.398.9187'
  8. '482 952 3315'
  9. '239 923 8115'
  10. '842 566 4692'
  11. '579-499-7527'
  12. NA
  13. '543.355.3679'

In [48]:
str_match(strings, phone)


219 733 8965219 733 8965
329-293-8753329 293 8753
NA NA NA NA
595 794 7569595 794 7569
387 287 6718387 287 6718
NA NA NA NA
233.398.9187233 398 9187
482 952 3315482 952 3315
239 923 8115239 923 8115
842 566 4692842 566 4692
579-499-7527579 499 7527
NA NA NA NA
543.355.3679543 355 3679

In [51]:
s1 <- str_pad("hadley", 10, "left")
s2 <- str_pad("hadley", 10, "right")
s3 <- str_pad("hadley", 10, "both")
s1; s2; s3
# str_pad 공백 문자 입력 ("문자", 공백 수, 공백 위치)


' hadley'
'hadley '
' hadley '

In [61]:
thanks_path <- file.path(R.home("doc"), "THANKS")
thanks_path

thanks <- str_c(readLines(thanks_path), collapse = "\n")
# 문자열을 한 line씩 읽어오되 \n로 구분해줌
thanks <- word(thanks, 1, 3, fixed("\n\n"))
# extract word index 1~3인 것 같은데?
cat(str_wrap(thanks), "\n")


'C:/R/R-34~1.1/doc/THANKS'
R would not be what it is today without the invaluable help of these people
outside of the R core team, who contributed by donating code, bug fixes and
documentation: Valerio Aimale, Thomas Baier, Henrik Bengtsson, Roger Bivand,
Ben Bolker, David Brahm, G"oran Brostr"om, Patrick Burns, Vince Carey, Saikat
DebRoy, Matt Dowle, Brian D'Urso, Lyndon Drake, Dirk Eddelbuettel, Claus
Ekstrom, Sebastian Fischmeister, John Fox, Paul Gilbert, Yu Gong, Gabor
Grothendieck, Frank E Harrell Jr, Torsten Hothorn, Robert King, Kjetil Kjernsmo,
Roger Koenker, Philippe Lambert, Jan de Leeuw, Jim Lindsey, Patrick Lindsey,
Catherine Loader, Gordon Maclean, John Maindonald, David Meyer, Ei-ji Nakama,
Jens Oehlschaegel, Steve Oncley, Richard O'Keefe, Hubert Palme, Roger D. Peng,
Jose' C. Pinheiro, Tony Plate, Anthony Rossini, Jonathan Rougier, Petr Savicky,
Guenther Sawitzki, Marc Schwartz, Arun Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner, Bill Venables,
Gregory R. Warnes, Andreas Weingessel, Morten Welinder, James Wettenhall, Simon
Wood, and Achim Zeileis. Others have written code that has been adopted by R and
is acknowledged in the code files, including 

In [64]:
cat(str_wrap(thanks, width = 40), "\n")
# width 제한


R would not be what it is today
without the invaluable help of these
people outside of the R core team,
who contributed by donating code,
bug fixes and documentation: Valerio
Aimale, Thomas Baier, Henrik Bengtsson,
Roger Bivand, Ben Bolker, David Brahm,
G"oran Brostr"om, Patrick Burns, Vince
Carey, Saikat DebRoy, Matt Dowle,
Brian D'Urso, Lyndon Drake, Dirk
Eddelbuettel, Claus Ekstrom, Sebastian
Fischmeister, John Fox, Paul Gilbert,
Yu Gong, Gabor Grothendieck, Frank E
Harrell Jr, Torsten Hothorn, Robert
King, Kjetil Kjernsmo, Roger Koenker,
Philippe Lambert, Jan de Leeuw, Jim
Lindsey, Patrick Lindsey, Catherine
Loader, Gordon Maclean, John Maindonald,
David Meyer, Ei-ji Nakama, Jens
Oehlschaegel, Steve Oncley, Richard
O'Keefe, Hubert Palme, Roger D. Peng,
Jose' C. Pinheiro, Tony Plate, Anthony
Rossini, Jonathan Rougier, Petr Savicky,
Guenther Sawitzki, Marc Schwartz, Arun
Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry
Therneau, Rolf Turner, Bill Venables,
Gregory R. Warnes, Andreas Weingessel,
Morten Welinder, James Wettenhall, Simon
Wood, and Achim Zeileis. Others have
written code that has been adopted by R
and is acknowledged in the code files,
including 

In [69]:
cat(str_wrap(thanks, width = 40, indent =2), "\n")
# indent = 첫 문장 들여쓰기 <-> 내어쓰기(exdent)


  R would not be what it is today
without the invaluable help of these
people outside of the R core team,
who contributed by donating code,
bug fixes and documentation: Valerio
Aimale, Thomas Baier, Henrik Bengtsson,
Roger Bivand, Ben Bolker, David Brahm,
G"oran Brostr"om, Patrick Burns, Vince
Carey, Saikat DebRoy, Matt Dowle,
Brian D'Urso, Lyndon Drake, Dirk
Eddelbuettel, Claus Ekstrom, Sebastian
Fischmeister, John Fox, Paul Gilbert,
Yu Gong, Gabor Grothendieck, Frank E
Harrell Jr, Torsten Hothorn, Robert
King, Kjetil Kjernsmo, Roger Koenker,
Philippe Lambert, Jan de Leeuw, Jim
Lindsey, Patrick Lindsey, Catherine
Loader, Gordon Maclean, John Maindonald,
David Meyer, Ei-ji Nakama, Jens
Oehlschaegel, Steve Oncley, Richard
O'Keefe, Hubert Palme, Roger D. Peng,
Jose' C. Pinheiro, Tony Plate, Anthony
Rossini, Jonathan Rougier, Petr Savicky,
Guenther Sawitzki, Marc Schwartz, Arun
Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry
Therneau, Rolf Turner, Bill Venables,
Gregory R. Warnes, Andreas Weingessel,
Morten Welinder, James Wettenhall, Simon
Wood, and Achim Zeileis. Others have
written code that has been adopted by R
and is acknowledged in the code files,
including 

In [70]:
sentences <- c("Jane saw a cat", "Jane sat down")
word(sentences, 1)  
word(sentences, 2)
word(sentences, -1)
word(sentences, 2, -1)
word(sentences[1], 1:3, -1)


  1. 'Jane'
  2. 'Jane'
  1. 'saw'
  2. 'sat'
  1. 'cat'
  2. 'down'
  1. 'saw a cat'
  2. 'sat down'
  1. 'Jane saw a cat'
  2. 'saw a cat'
  3. 'a cat'

In [73]:
str <- 'abc.def..123.4568.999'
word(str, 1, sep = fixed('..'))
word(str, 2, sep = fixed('..'))


'abc.def'
'123.4568.999'